Doggos in NYC¶

According to PetCareRX there are roughly 600,000 dogs and 500,000 cats in NYC!!!

About the data:

  • This data was taken from NYC Open Data: https://data.cityofnewyork.us/Health/NYC-Dog-Licensing-Dataset/nu7n-tubp
  • Consists of dogs license information in NYC

Objective:

  • My goals is NYC pet-dogs population analysis.
  • Analyze dog-breeds and their popularity across different neighbourhoods.
My purpose of working on this project is to develop my data cleaning, exploration, explanation, and visualization skills working on real-time data. I also wish to build my proficiency in different data visualization libraries in Python.¶
In [3]:
#import the necessary libraries

import numpy as np
import pandas as pd
import datetime as dt
import seaborn as sns
import regex as re
import matplotlib.pyplot as plt
from matplotlib.cbook import get_sample_data
from matplotlib.offsetbox import (OffsetImage, AnnotationBbox)
from palettable.colorbrewer.qualitative import Pastel1_7,Dark2_7
import plotly.express as px

Load the dataset¶

We would load the dataset, and perform initial Exploratory Data Analysis to get a borader picture of the data. We are also interested in performing some data cleaning tasks.

In [29]:
data = pd.read_csv('input/NYC_Dog_Licensing_Dataset.csv') #data for original dataset
data.head(20)
Out[29]:
AnimalName AnimalGender AnimalBirthYear BreedName ZipCode LicenseIssuedDate LicenseExpiredDate Extract Year
0 PAIGE F 2014 American Pit Bull Mix / Pit Bull Mix 10035.0 09/12/2014 09/12/2017 2016
1 YOGI M 2010 Boxer 10465.0 09/12/2014 10/02/2017 2016
2 ALI M 2014 Basenji 10013.0 09/12/2014 09/12/2019 2016
3 QUEEN F 2013 Akita Crossbreed 10013.0 09/12/2014 09/12/2017 2016
4 LOLA F 2009 Maltese 10028.0 09/12/2014 10/09/2017 2016
5 IAN M 2006 Unknown 10013.0 09/12/2014 10/30/2019 2016
6 BUDDY M 2008 Unknown 10025.0 09/12/2014 10/20/2017 2016
7 CHEWBACCA F 2012 Labrador Retriever Crossbreed 10013.0 09/12/2014 10/01/2019 2016
8 HEIDI-BO F 2007 Dachshund Smooth Coat 11215.0 09/13/2014 04/16/2017 2016
9 MASSIMO M 2009 Bull Dog, French 11201.0 09/13/2014 09/17/2017 2016
10 LOLA F 2006 Miniature Pinscher 10022.0 09/13/2014 10/03/2019 2016
11 LEMMY F 2005 Yorkshire Terrier 10003.0 09/13/2014 10/26/2017 2016
12 LUCY F 2014 Dachshund Smooth Coat Miniature 11215.0 09/13/2014 09/13/2019 2016
13 RICKY M 2014 German Shepherd Dog 11220.0 09/13/2014 09/13/2017 2016
14 SARAH F 2012 Unknown 10040.0 09/13/2014 09/13/2017 2016
15 MURPHY M 2012 American Pit Bull Mix / Pit Bull Mix 10463.0 09/13/2014 09/28/2017 2016
16 JUNE F 2010 Cavalier King Charles Spaniel 11238.0 09/13/2014 10/28/2019 2016
17 ELIZABETH F 2013 Cavalier King Charles Spaniel 10022.0 09/13/2014 09/13/2019 2016
18 AVERY F 2014 American Pit Bull Terrier/Pit Bull 10002.0 09/13/2014 09/13/2019 2016
19 SOPHIE F 2011 Boxer 10308.0 09/13/2014 10/23/2019 2016

Understanding the columns:¶

    Column Name             Column Description
•   AnimalName              User-provided dog name (unless specified otherwise)
•   AnimalGender            M (Male) or F (Female) dog gender
•   AnimalYearOfBirth       Year dog was born
•   BreedName               Dog breed
•   ZipCode                 Owner zip code
•   LicenseIssuedDate       Date the dog license was issued
•   LicenseExpiredDate      Date the dog license expires
•   Extract Year            Year the data was extracted

Exploratory Data Analysis¶

Before making some assumptions, we would like to have an overview of the data. In the process, we would dentify obvious errors, as well as better understand patterns within the data, detect outliers or anomalous events, find interesting relations among the variables.

In [35]:
#dimensions of the dataframe
data.shape
Out[35]:
(508196, 8)
In [36]:
#change all column headers to lowercase
data.columns = data.columns.str.lower()

Looks good!

In [38]:
#exploring the animal gender column
data.animalgender.unique()
Out[38]:
array(['F', 'M', nan], dtype=object)

We can see that there are 4 categories and clearly nan and ' ' are of no use to us. Hence, for analysis purposes, we would get rid of those rows.

In [41]:
#replace empty values with N/A to delete them later
data.animalgender = data.animalgender.replace({' ':"N/A"}).fillna("N/A")

#count number of unknown rows
len(data[data["animalgender"] == "N/A"])
Out[41]:
21
In [42]:
#remove the unknown rows
data = data.loc[(data["animalgender"] != "N/A")]
data.animalgender.unique()
Out[42]:
array(['F', 'M'], dtype=object)
In [43]:
#total unique values in borough column
data.zipcode.nunique()
Out[43]:
784
  • We can see that a lot of these boroughs are not from NYC. Not just that, the case of these boroughs is not consistent. NYC Boroughs - Manhattan, Brooklyn, Queens, Bronx, Staten Island.
  • We are only interested in NYC. Hence, we would be referring to another table that has NYC boroughs and their zipcodes
In [47]:
zipcodes = pd.read_csv('input/NYC_Borough_Zipcodes.csv') #NYC Borough Zipcodes

#keep only borughs that belong to NYC by their zipcodes
data = data.merge(zipcodes, on='zipcode', how='inner')
data.shape
Out[47]:
(490082, 11)
In [48]:
#convert all to lower case
# data.borough = data.borough.str.lower()

#remove trailing spaces
data.borough = data.borough.replace(r"^ +| +$", r"", regex=True)

data.borough.unique()
Out[48]:
array(['Manhattan', 'Bronx', 'Brooklyn', 'Staten Island', 'Queens'],
      dtype=object)
In [51]:
#create and plot a cross table for gender vs borough

crosstb = pd.crosstab(data.borough,data.animalgender)
crosstb
Out[51]:
animalgender F M
borough
Bronx 22865 29671
Brooklyn 61429 73070
Manhattan 73955 84087
Queens 43895 56637
Staten Island 20553 23920

Visualization 1 - Donut Chart & Bar Chart¶

  • Now, let's try to visualize this data to get some valuable information.
  • We will be using a donut chart in matplotlib to show Number of Doggos in NYC Boroughs and a barchard in seaborn to show gender distribution.
In [54]:
#1. donut chart Doggos in NYC Boroughs

my_circle = plt.Circle( (0,0), 0.7, color='white')

#number of doggos in nyc boroughs
borough_groupby_count = pd.Series(data.groupby(['borough'])['borough'].count().values)
nyc_boroughs = pd.Series(["Bronx","Brooklyn","Manhattan","Staten Island", "Queens"])
boroughs_count=pd.concat([nyc_boroughs,borough_groupby_count],axis=1)

#plot
plt.figure(figsize=(6,6))
plt.pie(boroughs_count.loc[:,1], labels=boroughs_count.loc[:,0]+ ': '+boroughs_count.loc[:,1].astype(str), colors=Pastel1_7.hex_colors)
p = plt.gcf()
p.gca().add_artist(my_circle)
#add text
plt.text(1.5,0, 'Manhattan has the most number of Doggos in NYC', fontsize = 22, bbox = dict(facecolor = 'red', alpha = 0.5))

#string at the center of the donut
sumstr = str(round(np.sum(borough_groupby_count)/1000,2)) + 'K'
plt.text(0., 0., sumstr, horizontalalignment='center', verticalalignment='center',fontsize = 32)
plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.


#2. barchart for Genders in NYC Boroughs


plt.figure(figsize=(15,8))
#stack and reset
stacked = crosstb.stack().reset_index().rename(columns={0:'value'})

# plot grouped bar chart
p = sns.barplot(x=stacked.borough, y=stacked.value, hue=stacked.animalgender, palette = 'PiYG')

#control aesthetics
#change theme - gridlines, axes & size
sns.set_theme()
sns.set_context("talk",font_scale = .8)
#adjust legend and axis labels
sns.move_legend(p, bbox_to_anchor=(1, 1.02), loc='upper left')
plt.legend(title = "Gender")
p.set(xlabel="Boroughs",ylabel="Number of Doggos",title="Dogs Genders in NYC")
# Remove borders
p.spines['top'].set_visible(False)
p.spines['right'].set_visible(False)
p.spines['bottom'].set_visible(False)
p.spines['left'].set_visible(False)

Infrerences:¶

  • We can infer from the visualizations that Manhatten, amongst all fice NYC boroughs, has the highest number of Doggos.
  • In each borrow, there are more male Doggos than female ones.
    • However, the difference is not drastic.

Visualization 2 - Horizontal Bar Chart & Lollipop Chart¶

We would like to find the most popular Doggo breed in NYC. Both these visualzations can be used to answer this question.

Visualization 2.1 Horizontal bar chart¶

In [84]:
#1. bar chart

#get the count of breed names and sort in descending order
breeds_groupby_count = (pd.Series(data.groupby(['breedname'])['breedname'].count().values).sort_values(ascending=False))
doggo_breeds = pd.Series(pd.Series(data.groupby(['breedname'])['breedname'].count()).sort_values(ascending=False).index.values)
breeds_count=pd.concat([pd.Series(doggo_breeds.values),pd.Series(breeds_groupby_count.values)],axis=1)
breeds_count = breeds_count.rename(columns={0: "breedname", 1: "count"})

#label the ones that have count less than 100
breeds_count.loc[breeds_count['count'] < 100, 'breedname'] = 'Other'
breeds_count = breeds_count.groupby(['breedname'],as_index =False).sum().sort_values(by='count',ascending=False)


plt.figure(figsize=(15,10))

p2 = sns.barplot(y="breedname", x="count", data = breeds_count.head(20),palette='Pastel2')

#control aesthetics

#change theme - gridlines, axes & size
sns.set_theme(style="whitegrid")
sns.set_context("talk",font_scale = .8)

#adjust legend and axis labels
p2.set(xlabel="Number of Doggos",ylabel="Breed Name",title="Most Popular Doggo Breed in NYC",)


# Remove borders
p2.spines[['top','bottom','left','right']].set_visible(False)

Visualization 2.2 Horizontal lollipop chart(a fancier version of a horizontal barchart)¶

In [85]:
#horizontal lollipop chart

# Reorder it based on the values
# ordered_df = df.sort_values(by='values')
my_range=range(1,len(breeds_count.head(20).index)+1)
plt.figure(figsize=(15,10))
 
# The horizontal plot is made using the hline function

plt.hlines(y=my_range, xmin=0, xmax=breeds_count['count'].head(20), colors=Dark2_7.hex_colors)
plt.plot(breeds_count['count'].head(20), my_range, "o",color='#D3D3D3')
 
# Add titles and axis names
# plt.yticks(0, breeds_count['breedname'].head(20))
plt.yticks(my_range, breeds_count['breedname'].head(20))
plt.title("Most Popular Doggo Breed in NYC", loc='left')
plt.xlabel('Number of Doggos')
# plt.ylabel('Breed Name')
plt.grid(axis='y')
# plt.axis('off')

# Show the plot
plt.show()

Infrerences:¶

  • Yorkshire Terrier, Shih Tzu and Chihuahua are some of the most popular breeds.

  • We can tell from the visualizations that there are a lot of Doggos whose breeds are not determined. We might want to invest more to gather more data for more accurate resources. Optimizing the business processes that have lead us to this data can be an efficient way to streamline the whole process.

VIsualization 3 - Map Chart¶

We would like to find the density of doggos accross all of the NYC boroughs. We would be using map charts to visualize the same.

In [87]:
#store zipcodes, borough and count in data frame
zipcode_mapping = pd.concat([pd.Series(data.groupby(['zipcode'])['zipcode'].count().index),
                             pd.Series(data.groupby(['zipcode'])['zipcode'].count().values)],axis=1)
zipcode_mapping = zipcode_mapping.merge(data[['zipcode','borough']],on='zipcode',how='inner')

#rename columns
zipcode_mapping = zipcode_mapping.rename(columns={'zipcode':'ZIPCODE',0:'Count','borough':'Borough'})

#change data type to match it with data type of ZIPCODE in GeoJSON file
zipcode_mapping = zipcode_mapping.astype({"ZIPCODE":int, "Count":int})
zipcode_mapping = zipcode_mapping.astype({'ZIPCODE' : 'string'})
zipcode_mapping.tail()
Out[87]:
ZIPCODE Count Borough
490077 11697 225 Queens
490078 11697 225 Queens
490079 11697 225 Queens
490080 11697 225 Queens
490081 11697 225 Queens
In [88]:
px.choropleth_mapbox
Out[88]:
<function plotly.express._chart_types.choropleth_mapbox(data_frame=None, geojson=None, featureidkey=None, locations=None, color=None, hover_name=None, hover_data=None, custom_data=None, animation_frame=None, animation_group=None, category_orders=None, labels=None, color_discrete_sequence=None, color_discrete_map=None, color_continuous_scale=None, range_color=None, color_continuous_midpoint=None, opacity=None, zoom=8, center=None, mapbox_style=None, title=None, template=None, width=None, height=None)>
In [89]:
#plot map using plotly express choloropleth mapbox
fig = px.choropleth_mapbox(zipcode_mapping, geojson=r"input/zip_code_040114.geojson",
                    locations='ZIPCODE',
                    color='Count',
                    color_continuous_scale="Pinkyl",
                    range_color=(0, 10000),
                    mapbox_style="carto-positron",
                    opacity=0.5,
                    featureidkey="properties.ZIPCODE",
                    zoom=9.25, center = {"lat": 40.7128, "lon": -74.0060},
                    hover_name=zipcode_mapping['Borough'],
                    hover_data=['Count'],
                    title="Active Doggo Licenses"
                    )
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
# fig.update_geos(fitbounds="locations", visible=True,scope="usa",showsubunits=True,subunitcolor="Black")
fig.show()

Infrerences:¶

  • On the colour hue Manhattan has the darkest shade and hence is the Borough with the most number of doggos.

Visualization 4 - Bubble Chart¶

We are interested in finding which doggo names are more popular among the citizens of NYC.

In [90]:
#find the doggo names according to their count
names = pd.Series(data.groupby(data["animalname"])["animalname"].count())
names_count = pd.concat([pd.Series(names.index), pd.Series(names.values)],axis=1)
names_count = names_count.rename(columns={"animalname":"Name",0:"Count"}).sort_values(by="Count",ascending=False)
names_count = names_count.loc[(names_count["Name"] != "UNKNOWN")].loc[(names_count["Name"] != "NAME NOT PROVIDED")].head(30)
names_count
Out[90]:
Name Count
2121 BELLA 5506
16034 MAX 4746
4688 CHARLIE 3933
5639 COCO 3637
15086 LUNA 3621
14747 LOLA 3397
21642 ROCKY 3374
15002 LUCY 2890
14991 LUCKY 2679
25260 TEDDY 2666
6335 DAISY 2656
3611 BUDDY 2634
1514 BAILEY 2401
16724 MILO 2345
17795 NAME 2279
25795 TOBY 2183
20498 PRINCESS 2165
5109 CHLOE 1990
17067 MOLLY 1870
14357 LEO 1807
16534 MIA 1762
19542 PENNY 1716
18731 OLIVER 1696
18847 OREO 1688
5839 COOPER 1682
11560 JACK 1628
5825 COOKIE 1525
14560 LILY 1520
21983 RUBY 1505
20493 PRINCE 1489
In [91]:
import circlify
# compute circle positions:
circles = circlify.circlify(names_count['Count'].sample(frac=1).tolist(), 
                            show_enclosure=False, 
                            target_enclosure=circlify.Circle(x=0, y=0)
                           )
circles.reverse()

def get_color(name, number):
    pal = list(sns.color_palette(palette=name, n_colors=number).as_hex())
    return pal
In [96]:
fig, ax = plt.subplots(figsize=(17, 17), facecolor='white')
ax.axis('off')
lim = max(max(abs(circle.x)+circle.r, abs(circle.y)+circle.r,) for circle in circles)
plt.xlim(-lim, lim)
plt.ylim(-lim, lim)

# print circles
for circle, label, emi, color in zip(circles, names_count['Name'], names_count['Count'], get_color('Pastel2', 30)):
    x, y, r = circle
    ax.add_patch(plt.Circle((x, y), r, alpha=0.7, color = color))
    plt.annotate(label +'\n'+ format(emi, ","), (x,y), size=20, va='center', ha='center')
plt.xticks([])
plt.yticks([])
plt.text(0,1, 'Bella, Max, Charlie, Coco & Luna are the\nmost popular names of Doggos in NYC',
        fontsize = 20, bbox = dict(facecolor = 'red', alpha = 0.5),
        horizontalalignment='center',
        verticalalignment='center')
plt.show()

Infrerences:¶

  • Some of the most popular doggo names are Bella, Max, Charlie, Luna and Coco.

---Bubble Chart Alternative---

In [50]:
#pre-defined funciton used used from matplotlib documentation

#uncomment everything starting from next line

# class BubbleChart:
#     def __init__(self, area, bubble_spacing=0):
#         """
#         Setup for bubble collapse.

#         Parameters
#         ----------
#         area : array-like
#             Area of the bubbles.
#         bubble_spacing : float, default: 0
#             Minimal spacing between bubbles after collapsing.

#         Notes
#         -----
#         If "area" is sorted, the results might look weird.
#         """
#         area = np.asarray(area)
#         r = np.sqrt(area / np.pi)

#         self.bubble_spacing = bubble_spacing
#         self.bubbles = np.ones((len(area), 4))
#         self.bubbles[:, 2] = r
#         self.bubbles[:, 3] = area
#         self.maxstep = 2 * self.bubbles[:, 2].max() + self.bubble_spacing
#         self.step_dist = self.maxstep / 2

#         # calculate initial grid layout for bubbles
#         length = np.ceil(np.sqrt(len(self.bubbles)))
#         grid = np.arange(length) * self.maxstep
#         gx, gy = np.meshgrid(grid, grid)
#         self.bubbles[:, 0] = gx.flatten()[:len(self.bubbles)]
#         self.bubbles[:, 1] = gy.flatten()[:len(self.bubbles)]

#         self.com = self.center_of_mass()

#     def center_of_mass(self):
#         return np.average(
#             self.bubbles[:, :2], axis=0, weights=self.bubbles[:, 3]
#         )

#     def center_distance(self, bubble, bubbles):
#         return np.hypot(bubble[0] - bubbles[:, 0],
#                         bubble[1] - bubbles[:, 1])

#     def outline_distance(self, bubble, bubbles):
#         center_distance = self.center_distance(bubble, bubbles)
#         return center_distance - bubble[2] - \
#             bubbles[:, 2] - self.bubble_spacing

#     def check_collisions(self, bubble, bubbles):
#         distance = self.outline_distance(bubble, bubbles)
#         return len(distance[distance < 0])

#     def collides_with(self, bubble, bubbles):
#         distance = self.outline_distance(bubble, bubbles)
#         idx_min = np.argmin(distance)
#         return idx_min if type(idx_min) == np.ndarray else [idx_min]

#     def collapse(self, n_iterations=50):
#         """
#         Move bubbles to the center of mass.

#         Parameters
#         ----------
#         n_iterations : int, default: 50
#             Number of moves to perform.
#         """
#         for _i in range(n_iterations):
#             moves = 0
#             for i in range(len(self.bubbles)):
#                 rest_bub = np.delete(self.bubbles, i, 0)
#                 # try to move directly towards the center of mass
#                 # direction vector from bubble to the center of mass
#                 dir_vec = self.com - self.bubbles[i, :2]

#                 # shorten direction vector to have length of 1
#                 dir_vec = dir_vec / np.sqrt(dir_vec.dot(dir_vec))

#                 # calculate new bubble position
#                 new_point = self.bubbles[i, :2] + dir_vec * self.step_dist
#                 new_bubble = np.append(new_point, self.bubbles[i, 2:4])

#                 # check whether new bubble collides with other bubbles
#                 if not self.check_collisions(new_bubble, rest_bub):
#                     self.bubbles[i, :] = new_bubble
#                     self.com = self.center_of_mass()
#                     moves += 1
#                 else:
#                     # try to move around a bubble that you collide with
#                     # find colliding bubble
#                     for colliding in self.collides_with(new_bubble, rest_bub):
#                         # calculate direction vector
#                         dir_vec = rest_bub[colliding, :2] - self.bubbles[i, :2]
#                         dir_vec = dir_vec / np.sqrt(dir_vec.dot(dir_vec))
#                         # calculate orthogonal vector
#                         orth = np.array([dir_vec[1], -dir_vec[0]])
#                         # test which direction to go
#                         new_point1 = (self.bubbles[i, :2] + orth *
#                                       self.step_dist)
#                         new_point2 = (self.bubbles[i, :2] - orth *
#                                       self.step_dist)
#                         dist1 = self.center_distance(
#                             self.com, np.array([new_point1]))
#                         dist2 = self.center_distance(
#                             self.com, np.array([new_point2]))
#                         new_point = new_point1 if dist1 < dist2 else new_point2
#                         new_bubble = np.append(new_point, self.bubbles[i, 2:4])
#                         if not self.check_collisions(new_bubble, rest_bub):
#                             self.bubbles[i, :] = new_bubble
#                             self.com = self.center_of_mass()

#             if moves / len(self.bubbles) < 0.1:
#                 self.step_dist = self.step_dist / 2

#     def plot(self, ax, labels, colors):
#         """
#         Draw the bubble plot.

#         Parameters
#         ----------
#         ax : matplotlib.axes.Axes
#         labels : list
#             Labels of the bubbles.
#         colors : list
#             Colors of the bubbles.
#         """
#         for i in range(len(self.bubbles)):
#             circ = plt.Circle(
#                 self.bubbles[i, :2], self.bubbles[i, 2], color=colors[i])
#             ax.add_patch(circ)
#             ax.text(*self.bubbles[i, :2], labels[i],
#                     horizontalalignment='center', verticalalignment='center')
In [52]:
# bubble_chart = BubbleChart(area=names_count['Count'].values,
#                            bubble_spacing=0.1)
# bubble_chart.collapse()
In [53]:
# names_count["Colors"] = get_color('Pastel2', 30)
# # names_count.insert(2, "Colors", get_color('Pastel2', 20), True)
# names_count
In [54]:
# names_count['Name'].values
In [70]:
#
# fig, ax = plt.subplots(subplot_kw=dict(aspect="equal"))
# fig.set_size_inches(15, 15, forward=True)
# bubble_chart.plot(
#     ax, names_count['Name'].values, names_count['Colors'].values)
# ax.axis("off")
# ax.relim()
# ax.autoscale_view()
# plt.show()
In [ ]:
## Part II: Dog Bite Analysis